ggml: optimize ggml_vec_dot_mxfp4_q8_0 dot product on ARM SVE#19171
ggml: optimize ggml_vec_dot_mxfp4_q8_0 dot product on ARM SVE#19171jiangshhh wants to merge 1 commit intoggml-org:masterfrom
Conversation
|
@ggerganov @slaren The PR introduces an ARM SVE optimization for This is my first PR to llama.cpp, so I would like to check if there are any additional steps that I should follow for the review. Thank you very much for your time and for maintaining this project. |
|
@Alcpz By any chance do you have ARM SVE hardware to test and review this? :) |
Unfortunately no, I would be happy to help otherwise. |
|
I've just spun up an AWS
|
| model | size | params | backend | threads | test | t/s |
|---|---|---|---|---|---|---|
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CPU | 8 | pp512 | 49.33 ± 0.01 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CPU | 8 | tg128 | 29.12 ± 0.01 |
build: f9bd518 (7955)
pr/19171
$ build/bin/llama-bench -m ~/.cache/llama.cpp/ggml-org_gpt-oss-20b-GGUF_gpt-oss-20b-mxfp4.gguf | model | size | params | backend | threads | test | t/s |
|---|---|---|---|---|---|---|
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CPU | 8 | pp512 | 47.07 ± 0.02 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CPU | 8 | tg128 | 28.82 ± 0.06 |
build: 18ad28c (7870)
gcc dump
$ echo | gcc -mcpu=neoverse-v2+crc+sve2-aes+sve2-sha3+nossbs+dotprod+i8mm+sve+nosme -dM -E - | grep __ARM_FEATURE_SVE
#define __ARM_FEATURE_SVE_BITS 0
#define __ARM_FEATURE_SVE_VECTOR_OPERATORS 1
#define __ARM_FEATURE_SVE2_AES 1
#define __ARM_FEATURE_SVE 1
#define __ARM_FEATURE_SVE2_SHA3 1
#define __ARM_FEATURE_SVE_MATMUL_INT8 1
#define __ARM_FEATURE_SVE_BF16 1
#define __ARM_FEATURE_SVE2 1
#define __ARM_FEATURE_SVE2_BITPERM 1lscpu
$ lscpu | grep -E "Model name|Flags"
Model name: Neoverse-V2
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti|
@taronaeo Regarding the SVE optimization for mxfp4, we initially observed approximately 2x performance improvement on FX700 (A64FX), while no significant speedup was observed on Graviton4 (Neoverse V2). This behavior is consistent with the underlying SIMD microarchitecture. As shown in the reference table, there are some differs across the following architectures.
A64FX (FX700)
Here, we can know that SVE provides a clear width and throughput advantage over NEON. Neoverse V2 (Graviton4/NVIDIA Grace)
In this case, the effective SIMD throughput of SVE and NEON is architecturally equivalent. Although SVE provides a more flexible programming model, the raw vector width and pipeline count are effectively the same as NEON. Additional Measurement on NVIDIA GraceAfter the latest refinement of the implementation, we re-measured performance on NVIDIA Grace (Neoverse V2) using llama-bench (8 threads, 512 prompt tokens, 128 generation tokens, 5 repetitions).
After (PR build 7957)
SummaryOn A64FX (512-bit SVE) → SVE has a clear hardware throughput advantage over NEON → 2x speedup observed. I hope this clarifies the architectural reason behind the observed performance differences. Thank you again for the valuable feedback. |
Proposal
This proposal introduces an ARM SVE-optimized implementation of ggml_vec_dot_mxfp4_q8_0 for the ggml/llama.cpp CPU backend.
The current implementation relies on scalar or NEON-based code paths, which do not fully utilize the wide vector capabilities available on modern ARM CPUs equipped with Scalable Vector Extension(SVE). By leveraging SVE intrinsics, this proposal aims to:
Verifying Features
The proposed SVE implementation was verified with the following considerations:
Accumulation logic and scaling factors follow the original ggml_vec_dot_mxfp4_q8_0 definition.
The implementation uses SVE intrinsics only, without assuming a fixed vector length.
The SVE path is guarded by __ARM_FEATURE_SVE to ensure it is executed only on supported hardware.
Non-SVE platforms continue to use the existing scalar or NEON implementations without modification.
The change does not affect other quantization paths.
Performance check
The performance was measured with FX700.
Performance is improved as follows. The value is tokens per second.
The command used to measure the performance is
llama-batched --model ${PATH_TO_MODEL} --prompt 'AI is going to' --parallel 8 --predict 128 --seed 0 --threads 48